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Abstract 

Motivated by a computer experiment for the design of a rocket booster, this paper 
explores nonstationary modehng methodologies that couple stationary Gaussian 
processes with treed partitioning. Partitioning is a simple but effective method 
for dealing with nonstationarity. The methodological developments and statistical 
computing details which make this approach efficient are described in detail. In 
addition to providing an analysis of the rocket booster simulator, our approach is 
demonstrated to be effective in other arenas. 
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1 Introduction 



Much of modem engineering design is now done through computer modehng, which is 
both faster and more cost-effective than building small-scale models, particularly in the 
earlier stages of design when more scope for changes is desired. As design proceeds, 
increasingly sophisticated simulators may be created. Our work here was motivated by 
a simulator of a proposed rocket booster. NASA relies heavily on simulators for design, 
as wind tunnel experiments are quite expensive and still not fully realistic of the range 
of flight experiences. In particular, one of the highly critical points in time for a rocket 
booster is the moment that it re-enters the atmosphere. Such conditions are difficult 
to recreate in a wind tunnel, and it is obviously impossible to run a standard physical 
experiment. Thus to learn about the behavior of the proposed rocket booster, NASA 
uses computer simulation. 

Simulators can involve large amounts of physical modeling, requiring the numeri- 
cal solution of complex systems of differential equations. The NASA simulator was no 
exception, typically requiring between five and twenty hours for a single run. Thus, 
NASA was interested in creating a statistical model of the simulator itself, an emulator 
or surrogate model, in the terminology of computer modeling. The standard approach in 
the literature for emula tion is to model the simulator output with a sta t ionary smooth 



Gaus sian process (GP) (j Sacks et al 



1989 



Kennedy and O'Hagan 



2001 



Santner et al. 



20031 ). However, this approach proved to be inadequate for the NASA data. In partic- 
ular, we were faced with problems of nonstationarity, heteroscedasticity, and the size of 
the dataset. Thus we introduce here an e xpansion of GPs, bas ed on the idea of Bayesian 



2002 



Denison et al. 



2OO2I ). which is able to address 



partition models (jChipman et al 
these issues. 

GPs are conceptually straightforward, can easily accommodate prior knowledge in 
the form of covariance functions, and can return estimates of predictive confidence, which 
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were desired by NASA. However, we highlight three disadvantages of the standard form 
of a GP which we had to confront on this dataset, and expect to encounter on a wide 
range of other apphcations. First, inference on the GP scales poorly with the number 
of data points, N, typically requiring computing time in 0{N^) for calculating inverses 
of N X N covariance matrices. Second, GP models are usually stationary in that the 
same covariance structure is used throughout the entire input space, which may be too 
strong of a modeling assumption. Third, the estimated predictive error (as opposed to 
the predictive mean value) of a stationary model does not directly depend on the locally 
observed response values. Rather, the predictive error at a point depends only on the 
locations of the nearby observations and on a global measure of error that uses all of 
the discrepancies between observations and predictions without regard for their distance 
from the point of interest (because of the stationarity assumption). (Section 14. 3l provides 
more details, in particular note that Eq. ( fT2l) does not depend on z.) In many real- world 
spatial and stochastic problems, such a uniform modeling of uncertainty will not be 
desirable. Instead, some regions of the space will tend to exhibit larger v ariability than 



others. On the other hand, fu 



1999 



Schmidt and O'Hagan 



ly no nstationary Bayesian GP models (e.g., 



Higdon et al. 



20031 ) can be difficult to fit, and are not computationally 
tractable for more than a relatively small number of datapoints. Further discussion of 
nonstationary models is deferred until the end of Section 13. 2[ 

All of these shortcomings can be addressed by partitioning the input sp ace into re- 



gions , and fitting separate stationary GP models within each region (e.g.. 



Kim et al. 



20051 ). Partitioning provides a relatively straightforward mechanism for creating a non- 
stationary model, and can ameliorate some of the computational demands by fitting 
models to less data. A Bayesian model averaging approach allows for the explicit es- 
timation of predictive uncertainty, which can now vary beyond the constraints of a 
stationary model. Finally, an R package with implementations of all of the models dis- 
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cussed in this paper is available at 

http : / /www . cran . r-pro j ect . org/web/ packages/tgp/ index . html. We note that by 
partitioning, we do not have any theoretical guarantee of continuity in the fitted func- 
tion. However, as we demonstrate in several examples, Bayesian model averaging yields 
mean fitted functions that are typically quite smooth in practice, giving fits that are 
indistinguishable from continuous functions except when the data call for the contrary. 
Indeed the ability to accurately model possible discontinuities is a side benefit of this 
approach. 

The rest of the paper is organized as follows. Section [2] describes the motivating data 
in further detail. Section [3] provides some background material. Section H] combines 
stationary GPs and treed partitioning to create treed GPs, implementing a tractable 
nonstationary model for nonparametric regression. In Section Owe return to the analysis 
of the rocket booster data, and in Section [6] we conclude with some discussion. 



2 The Langley Glide-Back Booster Simulation 

The Langley Glide-Back Booster (LGBB) is a proposed rocket booster under design 
at NASA. Standard rocket boosters are created to be reusable, assisting in the launch 
process and then parachuting back to the Earth after their fuel has been exhausted. 
Their return path is planned so that they fall into the ocean, where they can be recovered 
and reused. The LGBB represents a new direction in booster design, as it would have 
wings and a tail, looking somewhat similar to the space shuttle. The idea is that it 
would gracefully glide back down, rather than plummeting into the ocean. 

The development of the booster is being done prima rily through the use of computer 



simulators. The particular model (IRogers et al. 



20031 ) with which we were involved is 



based on computational fluid dynamics simulators that numerically solve the relevant 
inviscid Euler equations over a mesh of lA million cells. Each run of the Euler solver for 
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a given set of parameters can take 5-20 hours on the NASA computers. The simulator 
is theoretically deterministic, but the solver is typically started with random initial 
conditions and does not always numerically converge. There is an automated check for 
convergence which is mostly accurate, but some runs are marked as accepted despite 
their false convergence, or they converge to a clearly inferior local mode. For those 
runs that fail the automated convergence check, the solver is restarted at a different 
set of randomly chosen initial conditions. Our NASA collaborators have commented 
that input configurations arbitrarily close to one another can fail to achieve the same 
estimated convergence, even after satisfying the same stopping criterion. Thus neither 
simple interpolation of the data nor a Gaussian process model without an error term 
will be adequate, as smoothing will be necessary to reduce the impact of the inaccurate 
runs. 

The simulator models the forces felt by the vehicle at the moment it is re-entering 
the atmosphere. As a free body in space, there are six degrees of freedom, so the six 
relevant forces are lift, drag, pitch, side-force, yaw, and roll. For this project, the interest 
focused just on the lift force, as that is the most important one for keeping a vehicle 
aloft. The inputs to the simulator are the speed of the vehicle at re-entry (measured by 
Mach number), the angle of attack (the alpha angle), and the sideslip angle (the beta 
angle). Thus the primary goal is to model the lift force as a function of speed, alpha, 
and beta. The sideslip angle is quantized in the experiments, so it is run only at six 
particular levels. Speed ranges from Mach to 6, and the angle of attack, alpha, varies 
from negative five to thirty degrees. The simulator was run at 3041 locations, over a 
combination of three hand-designed grids. The first grid was relatively coarse and was 
equally spaced over the whole region of interest. Two successively finer grids on smaller 
regions primarily around Mach one were further run, as the initial run showed that the 
most interesting part of the input space was generally around the sound barrier. This 
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makes sense because the physics in the simulator comes from two completely different 
regimes, a subsonic regime for speeds less than Mach one, and a supersonic regime for 
speeds greater than Mach one. What happens close to and along the boundary is the 
most difficult part of the simulation. 



lift=f(mach,alpha,beta=0,) llft=f(niach,alpha,beta=O.S) llft=f(mach,alpha,beta=1 ) 




lift=f(mach,alpha,beta=2) lift=f(mach,alpha,beta=3) llft=f(mach,alpha,beta=4) 




Figure 1: Interpolation of lift by speed and angle of attack for all sideslip levels. Note 
that for levels 0.5 and 3 (center), Mach ranges only in (1,5) and (1.2,2.2). 

The upper-left plot in Figure [T] shows an interpolation of the simulator output for 
the lift surface as a function of speed and angle of attack, when the sideslip angle is zero. 
The primary feature of this plot is the large ridge which appears at Mach one and larger 
angles of attack. The transition from subsonic to supersonic is a sharp one, and it is not 
clear whether one would want to use a continuous model or to introduce a discontinuity. 
While much of the surface is quite smooth, parts of the surface, particularly around Mach 
one, are less smooth. So it is clear that the standard computer modeling assumption 
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of a stationary process will not work well here. We will need a method that allows for 
a nonstationary formulation, yet that can still produce uncertainty estimates, and is 
computationally feasible to fit on a dataset of this size. 

One other feature of the data that appears in Figure [1] is the issue of numerical 
convergence. In the upper-left, corner of the upper-left plot (high angle of attack, low 
speed), there is a single spike that looks out of place. Our collaborators at NASA 
believe this to be a result of a false convergence by the simulator, so we would want our 
surrogate model to smooth this one point out. This stands in contrast to most computer 
modeling problems, where one wants to interpolate the deterministic simulator without 
smoothing. Here we require smoothing to compensate for problems with the simulator. 

The other plots of Figure [1] show the issue of false convergence more strongly for 
other sideslip settings. In the center plots (levels one-half and three), there are noisy 
depressions in the surface for moderate speed and high angle of attack. Because this 
feature is not seen in the other plots by sideslip angle, one may suspect that this region 
could be showing more numerical instability than signal. Thus, there is a need to 
combine information across the levels in order to smooth out numerical problems with 
the simulator. Note that because no subsonic inputs were sampled for these slices, the 
ridge around Mach one does not appear in these two plots. 

For sideslip levels of one, two, and four the surface again appears to be most inter- 
esting around Mach one. But instead of a clean ridge at levels one and four it is noisy, 
especially at high angles of attack. It is not clear if this variability is due to false con- 
vergence of the simulator, inadequate coverage in the design, or if the boundary really is 
this complicated. The NASA scientists have postulated that the instabilities are more 
likely to be numerical, rather than structural, but we will want our surrogate model to 
capture this uncertainty. 

Also of concern are the deviations from the smooth trend at high speeds (particularly 
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for level four), with upward deviations at low angles of attack and downward deviations 
at high angles. Again, it is suspected that these are the result of false numerical conver- 
gence of the simulator, but we cannot rule out a priori that the physical system itself 
becomes unstable at higher speeds. So we desire smoothing, but with an appropriate 
local estimate of uncertainty. Fitting with a single stationary GP would give uncer- 
tainty estimates that were fairly uniform across the space, because of the assumption of 
stationarity. Thus we turn to a partitioning approach. 

Understanding the mean surface is important for the engineers for several reasons. 
First, they may discover potential problems with the design, which could lead to struc- 
tural changes in the design. Second, they will need to determine the optimal flight 
conditions so they can plan the flight trajectories. Third, they need to be able to make 
contingency plans in case problems arise during a mission. If some of the stabilizing 
rockets fail and the vehicle must re-enter at an unplanned angle or speed, they will need 
to be able to map out its new trajectory and adjust the process as necessary. The en- 
gineers are interested in not just the mean surface, but also the uncertainty associated 
with prediction, because these uncertainties are not constant across the surface. 



3 Related work 



Our approach to nonparametric and semiparametric nonstationary modeling combines 
standard GPs and treed partitioning within the context of Bayesian hierarchical model- 
ing and model averaging. We assume that the reader is fa miliar with the bas ic concepts 
of Bayesian inference via Markov chain Monte Carlo (e.g. 
duction to GPs and treed partition modeling follows. 



GilksetaL 



19961 ). An intro- 
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3.1 Stationary Gaussian Processes 

A common specification of stocliastic processes for spatial data, of which the stationary 
Gaussian process (GP) is a particular case, specifies that model outputs (responses) 
z, depend on multivariate inputs (explanatory variables) x, as 2;(x) = /3^f(x) + ty(x) 
where (5 are linear trend coefficients, u;(x) is a zero mean random process with covariance 
C(x, x') = cr^A'(x, x'), K is a correlation matrix, and (3 is independent of w in the prior. 
Low-order polynomials are sometimes used instead of the simple linear mean /3^f (x), or 
the mean process is specified generically, often as ^(x, /3) or ^(x). 

Formally, a Gaussian process is a collection of random variables Z( x) ind e xed b y x. 



having a jointly Gaussian distribution for any finite subset of indices ( ISteid . Il999l ). It 
is specified by a mean function yu(x) = £'(Z(x)) and a correlation function /('(x, x') = 
^£'([Z(x) — /i(x)][Z'(x') — /i(x')]'''). We assume that the correlation function can be 
written in the form 

A'(xj, Xfcl^f) = K*{^j, Xfc) + g5j^k- (1) 

where 5.^. is the Kronecker delta function and K* is a proper underlying parametric 
correlation function. The g term in Eq. ([1]) is referred to as the nugget. It must always 
be positive {g > 0), and it serves two purposes. First, it provides a mechanism for 
introducing measurement error into the stochastic process. It arises when considering a 
model of the form Z(x) = ,^(x, /3) + u;(x) + ?7(x), where !/;(■) is a process with correlations 
governed by K*, and ?7(-) is simply Gaussian noise. Second, the nugget helps prevent K 
from becoming numerically singular. Notational convenience and conceptual congruence 
motivates referral to K as a correlation matrix, even though the nugget term {g) forces 
A'(xj,Xi) > 1. There is an isomorphic model specification wherein K depicts proper 
correlations. Under both specifications K* does indeed define a valid correlation matrix 
K*. 

The correlation functions K*{-, ■) are typically specified through a low dimensional 
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parametric structure, which also guarantees that they are symmetric and positive semi- 
definite. Here we focus on the power family, although our m ethods are clearly extensible 



Abrahamsen 



to other families, such as the M atern class (IMatern 
lation structures can be found in 



(119971 ) or 



19861). F u rther discussion of corre- 



Steinl (I1999I ). The power family 



of correlation functions includes the simple isotropic parameterization 



K*(xj,Xfc|(i) = exp 



d 



(2) 



where d > is a single range parameter and po G (0, 2] determines the smoothness of 
the process. Thus the correlation of two points depends only on the Euclidean distance 
||xj — Xfcll between them. A straightforward enhancement to the isotropic power family 
is to employ a separate range parameter di in each dimension {i = l,...,mx). The 
resulting correlation function is still stationary, but no longer isotropic: 



mx 



JC(xj,Xfc|d) = exp - 



PO 



i=l 



di 



(3) 



3.2 Treed Partitioning 

Many spatial modeling problems require more fiexibility than is offered by a stationary 
GP. One way to achieve a more fiexible, nonstationary, process is to use a partition 
model — a meta-model which divides up the input space and fits different base models 
to data independently in each region. Treed partitioning is one possible approach. 

Treed partition models typically divide up the input space by making binary splits 
on the value of a single variable (e.g., xi > 0.8) so that partition boundaries are parallel 
to coordinate axes. Partitioning is recursive, so each new partition is a sub-partition of 
a previous one. For example, a first partition may divide the space in half by whether 
the first variable is above or below its midpoint. The second partition will then divide 
only the space below (or above) the midpoint of the first variable, so that there are now 
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three partitions (not four). Since variables may be revisited, there is no loss of generality 
by using binary splits, as multiple splits on the same variable will be equivalent to a 
non-binary split. In each partition (leaf of th e tree), an independen t model is applied. 



19841 ) are an example of 



Classification and Regression Trees (CART) fiBreiman et al 
a treed partition model. CART, which fits a constant surface in each leaf, has become 
popular because of its ease of use, clear interpretation, and ability to provide a good fit 
in many cases. 



The Bayesian ap proach is straightforward to apply to CART (jChipman et al 



1998 



Denison et al. 



19981 ). provided that one can specify a meaningful prior for the size of the 
tree. We follow Chipman et al. (1998) who specify the prior through a tree-generating 
process and enforce a minimum amount of data in order to infer the parameters in each 
partition. Starting with a null tree (all data in a single region), a leaf node ?7 G T, 
representing a region of the input space, splits with probability a(l -|- qri)~^, where 
is the depth oi r] E T and a and b are parameters chosen to give an appropriate size 
and spread to the distribution of trees. Further details are available in the Chipman 
et al. papers. For our models, we have found that default values of a = 0.5 and b = 2 
often work well in practice, although in any particular problem prior knowledge may call 
for other values. The prior for the splitting process involves first choosing the splitting 
dimension u from a discrete uniform, and then the split location s is chosen uniformly 
from a subset of the locations X in the dimension. Integrating out dependence on 
the tree structure T can be accomplished via Reversible- Jump (RJ) MCMC as further 
described in Section 14.2.21 

Chipman et al. (2002) generalize Bayesian CART to create the Bayesian treed linear 
model (LM) by fitting hierarchical LMs at the leaves of the tree. In Section H] we 
generalize further by proposing to fit station ary CPs iii each of the leaves of the tree. 



This approach bears some similarity to that of 



Kim et al 



(120051 ). who fit separate CPs in 
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each element of a Voronoi tessellation. The treed GP approach is better geared toward 
problems with a smaller number of distinct partitions, leading to a simpler overall model. 
Voronoi tessellations allow an intricate partitioning of the space, but have the trade-off 
of added complexity and can produce a final model that is difficult to interpret. The 
tessellation approach also has the benefit of not being restricted to axis-aligned partitions 
(although in some cases, simple transformations such as rotating the data will suffice to 
all ow axis-aligned partitions). A nice review of Bayesian partition modeling is provided 



by 



Denison et al. 



fl2002h 



Other approaches to nonstationary modeling include those which use spatial defor- 
mations and process convolutions. The idea behind the spatial deformation approach is 
to map nonstationary inputs in the original, geographical, s pace i nto a dispersion space 



wherein the process is stationary. 



Sampson and GuttorpI (119921 ) use th in-plate sphne 



mode 



s and multidimensional scaling (MDS) to construct the m apping. 



( 2001 ) explore a similar methodology from a Bayesian perspective. 



Damian et al. 



Schmidt and O'Hagan 



(120031) also take the Bayesian approach, but put a Gau s sian 



ping. The process convolution approach (IHigdon et al. 



j rocess p r ior on the map- 



1999; 



Fuentes 



2002 



Paciorek 



20031 ) proceeds by allowing the convolution kernels to vary smoothly in parameterization 
as an unknown function of their spatial location. A common theme among such nonsta- 
tionary models is the introduction of meta-structure which ratchets up the flexibility of 
the model, ratcheting up the computational demands as well. This is in stark contrast 
to the treed approach that introduces a structural mechanism, the tree T, that actually 
reduces the computational burden relative to the base model, e.g., a GP, because smaller 
correlation matrices are inverted. A key difference is that these alternative approaches 
strictly enforce continuity of the process, which requires much more effort than the treed 
approach. 
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4 Treed Gaussian process models 

Extending the partitioning ideas of Chipman et al. (1998, 2002) for simple Bayesian 
treed models, we fit stationary GP models with linear trends independently within each 
of R regions, {r^}^^-^^, depicted at the leaves of the tree T, instead of constant (1998) 
or linear (2002) r nodels. The tree is averaged out by integrating over possible trees. 



using RJ-MCMC ( iRichardson and Green 



19971 ). with the tree prior specified through a 



tree-generating process. Prediction is conditioned on the tree structure, and is averaged 
over in the posterior to get a full accounting of uncertainty. 

4.1 Hierarchical Model 

A tree T recursively partitions the input space into into R non-overlapping regions: 
{^u}u=i- Each region contains data = {X,^,Z,^}, consisting of n^, observations. 
Let m = mx + 1 be number of covariates in the design (input) matrix X plus an intercept. 
For each region r^, the hierarchical generative GP model is 

Z,|/3„ al ~ NnAFuf3,, a^K,), (3^ ~ iV^(/^, B) 

/3Ja^,r2, W,/3o ~ iV™(/3o, t^'r'W) ~ JG(a,/2, g,/2), (4) 

al ~ IG{aj2,qj2), W'^ ~ W{{pV)-\p), 

with F,^ = (1, Xj,), and W is an m x m matrix. The A^, IG, and W are the (Multivari- 
ate) Normal, Inverse- Gamma, and Wishart distributions, respectively. Hyperparameters 
l-i,'B,Y , p,aa-,qa,C(T,QT are treated as known. The model (jlj) specifies a multivariate 
normal likelihood with linear trend coefficients /3^, variance a^, and x correlation 
matrix K.^. The coefficients are believed to have come from a common unknown 
mean /3q and region-specific variance a^T^. There is no explicit mechanism in the model 
(jl]) to ensure that the process near the boundary of two adjacent regions is continu- 
ous across the partitions depicted by T. However, the model can capture smoothness 
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through model averaging, as will be discussed in Section 14.31 In our work with models 
for physical processes, we frequently encounter problems with phase transitions where 
the response surface is not smooth at the boundary between distinct physical regimes, 
such as subsonic vs. supersonic flight of the rocket booster, so we view the ability to flt 
a discontinuous surface ClS db feature of this model. 

The GP correlation structure J'r^(xj,Xfc) = ii'*(xj,Xfc) + g^Sj^k generating Kj, for 
each partition r^, takes to be from the isotropic power family ([2]), or separable power 
family ([3]), with a flxed power po? but unknown (random) range and nugget parameters. 
However, since most of the following discussion holds for generated by other families, 
as well as for unknown po? "we shall refer to the correlation parameters indirectly via the 
resulting correlation matrix K, or function i^(-,-). For example, p(K^) can represent 
either of p{d^,gi,) or p{d^,g^), etc. Priors that encode a preference for a model with a 
nonstationary global covariance structure are chosen for parameters to K* and g,y. In 
particular, we propose a mixture of Gammas prior for d: 

p{d, g) = p{d) X p{g) = p{g) x ^[G{d\a = 1, /? = 20) + G{d\a = 10, /? = 10)]. (5) 

It gives roughly equal mass to small d representing a population of GP parameteriza- 
tions for wavy surfaces, and a separate population for those which are quite smooth or 
approximately linear. We take the prior for g to be Exp(A). Alternativ ely, one could 



encode the prior as p{d,g) = p{d\g)p{g) and then use a reference prior (IBerger et al. 



200ll ) for p{d\g). We prefer the more deliberate mixture speciflcation both because of its 



modeling implications, as well as its ability to interface well with limiting linear models 



(IGramacy and Lee 



20081 ) 



It may also be sensible to deflne the prior for {K,cr^,r^}j, hierarchically, depending 
on parameters 7 (not indexed by u), similar to how the population of [3,, parameters is 
given a common prior in terms of W and /3o in (jlj). 
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4.2 Estimation 

The data = {X, Z},^ are used to update the GP parameters 0^ = cr^, K, r^}^, 
for u = 1,. . . ,R. Conditional on the tree T, the full set of parameters is denoted as 
e = 6oU U^Li^i^, where = {W,/3o,7} denotes upper-level parameters from the 
hierarchical prior that are also updated. Samples from the posterior distribution of 6 
are gathered using Markov chain Monte Carlo (MCMC) by first conditioning on the 
hierarchical prior parameters 0q and drawing 0,^\0q for z/i, . . . , z//j, and then is drawn 
as ^olUf=i^i^- Section [4.2.11 gives the details. All parameters can be sampled with 
Gibbs steps, except those that parameterize the covariance function K{-, ■), e.g., {d^g}^, 
which require Metropolis-Hastings (MH) draws. Section 14.2.21 shows how RJ-MCMC is 
used to gather samples from the joint posterior of {0,T) by alternately drawing 6\T 
and T\6. 

4.2.1 GP parameters given a tree (T) 

Conditional conjugacy allows Gibbs sampli ng for mo st para meters. Full derivations 



Gramacyl (120051 ). The linear regression 



of the following equations are available in 
parameters and prior mean /3q both have multivariate normal full conditionals: 
/3y|rest ~ iVm(3r/,cr^V^J, and /3o|rest ~ Nm{^o,V^^), where 

V^^ = (FjK;iF,+W-Vr,2)-\ = V^^(F:K;^Z,+W-i/3o/r2), (6) 

The linear variance parameter follows an inverse-gamma: 
r^lrest ~ + m)/2, (q, + K)/2), where K = if3, - f3,yw~\f3, - (3^)/al. 



The linear model covariance matrix W follows an inverse- Wishart: 

R 



R ^ 

W-i|rest ~ {{pV+Yy,)-\p + R) , where = Y,l ^(/3,-/3o)(/3.-/3o) 
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Analytically integrating out /3,, and a?, gives a m arginal posterior for and improves 



mixing of the Markov chain (IBerger et al. 



200 ih . 



where V. = ZJK^^Z, + /3([W-i/3o/r' - (8) 

Eq. ([7]) can be used to iteratively obtain draws for the parameters of K^{-, ■) via Metropolis- 
Hastings (MH), or as part of the acceptance ratio for proposed modifications to T [see 
Section 14.2.2] . Any hyperparameters to Ki^{-, ■), e.g., parameters to priors for {d,g}iy of 
the isotropic power family, would also require MH draws. The conditional distribution 
of al with /3j, integrated out allows Gibbs sampling: 

al\Z,, d„ g, l3o, W ~ IG{{a^ + n,)/2, (g, + ^,)/2). (9) 

4.2.2 Tree (T) 

Integrating out dependence on the tree structure (T) is accomplished by RJ-MCMC. We 



Chipman et al. 



(119981 ) — grow, prune, change, swap — with 



augment the tree operations of 
a rotate operation. 

A change operation proposes moving an existing split-point {u, s} to either the next 
greater or lesser value of s (s+ or s_) along the u"' column of X. This is accomplished 
by sampling s' uniformly from the set {u^, Sy}l^^i^ x {+, — } which causes the MH 
acceptance ratio for change to reduce to a simple likelihood ratio since parameters 6r in 
regions r below the split-point {u, s'} are held fixed. 

A swap operation proposes changing the order in which two adjacent parent-child 
(internal) nodes split up the inputs. An internal parent-child node pair is picked at ran- 
dom from the tree and their splitting rules are swapped. However, swaps on parent-child 
internal nodes which split on the same variable cause problems because a child region 
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below both parents becomes empty after the operation. If instead a rotate operation 
from Binary Search Trees (BSTs) is performed, the proposal will almost always accept. 
Rotations are a way of adjusting the configuration a nd height of a BST w ithout violating 



the BST property, as used, e.g., by red-black trees (I Gormen et al 



1990) 



{1,5} 



{1,3} 




Figure 2: Rotating on the same variable, where 7^, 7^,7^ are arbitrary sub-trees . 

In the context of a Bayesian MCMC tree proposal, rotations encourage better mixing 
of the Markov chain by providing a more dynamic set of candidate nodes for pruning, 
thereby helping escape local minima in the marginal posterior of T. Figure [2] shows 
an example of a successful right-rotation where a swap would produce an empty node 
(at the current location of 7^). Since the partitions at the leaves remain unchanged, 
the likelihood ratio of a proposed rotate is always 1. The only "active" part of the MH 
acceptance ratio is the prior on T, preferring trees of minimal depth. Still, calculating 
the acceptance ratio for a rotate is non-trivial because the depth of two of its sub-trees 
change. Figure [2] shows a right-rotate, where nodes in 7i decrease in depth, while those 
in increase. The opposite is true for left-rotation. If / = {/j,/^} is the set of nodes 
(internals and leaves) of Ti and T^, before rotation, which increase in depth after rotation, 
and D = {Di, Dg} are those which decrease in depth, then the MH acceptance ratio for 
a right-rotate is 
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p{T) p{T,)p{T,) n,6/. «(i + n,e/. [1 - «(i + 

The MH acceptance ratio for a left-rotate is analogous. 

Grow and prune operations are complex because they add or remove partitions, 
changing the dimension of the parameter space. The first step for either operation is to 
uniformly select a leaf node (for grow), or the parent of a pair of leaf nodes (for prune). 
When a new region r is added, new parameters {/'^(■, ■), r^jr must be proposed, and 
when a region is taken away the parameters must be absorbed by the parent region, 
or discarded. When evaluating the MH acceptance ratio the linear model parameters 
{/3, cr^jr are integrated out (I7j). One of the newly grown children is uniformly chosen to 
receive the correlation function K{-, ■) of its parent, essentially inheriting a block from 
its parent's correlation matrix. To ensure that the resulting Markov chain is ergodic 
and reversible, the other new sibling draws its K{-, ■) from the prior thus giving a unity 
Jacobian term in the RJ-MCMC. Note that grow operations are the only place where 
priors are used as proposals; random- walk proposals are used elsewhere [see Section 1^^ . 

Symmetrically, prune operations randomly select parameters from K{-, ■) for the 
consolidated node from one of the children being absorbed. After accepting a grow or 
prune, can be drawn from its marginal posterior, with /^^ integrated out ([9]), followed 
by draws for (3^. and the rest of the parameters in the r**" region. 

Let {X, Z} be the data at the new parent node rj at depth g^, and {Xi,Zi} and 
{X2,Z2} be the partitioned child data at depth g,? + 1 created by a new split {m, s}. 
Also, let V be the set of prunable nodes of T, and Q the set of growable nodes. If V 
are the prunable nodes in T' — after the (successful) grow at rj — and the parent of rj was 
prunable in T, then \'P'\ = \V\. Otherwise \V'\ = \V\ + 1. The MH ratio for grow is: 
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16^1 


a{l + q,)-\l-a{2 + qr,)-'')^p(K^, 


Zi,/3o,r2,WMK2 


Z2,/3o,r|,W) 


\V' 


l-ail + qri)-^ 


p(K|Z,/3o,r2,W)g(K2) 



(10) 



assuming that Ki was randomly chosen to receive the parameterization of its parent, K, 
and that the new parameters for K2 are proposed according to q. The prune operation 
is analogous. Note that for the posteriors p(K|Z, (3q,t'^, W), 

p(Ki|Zi, /3o, r^, W) and p(K2|Z2, /3q, r|, W), the "constant" terms in ([7]) are required 
because they do not occur the same number of times in the numerator and denominator. 

4.3 Treed GP Prediction 



199J). Con- 



Prediction under the above GP model is straightforward (iHjort and Omrd . 
ditional on the covariance structure, the predicted value of ^(x G r,y) is normally dis- 
tributed with mean and variance 

^(x) = ^(Z(x)l data,x G r,) = ^(x)^, + k,(x)^K;i(Z, - FJ,), (11) 
(t(x)^ = Var(z(x)| data,x G r^) = o-^[k(x,x) - qJ(x)C~^q,,(x)], (12) 

where C;^ = (K, + rXWFj)-^ q,(x) = k,(x) + r^Wf (x) (13) 

k{^, y) = i^.(x, y) + r^r (x)Wf (y) 

with f'''(x) = (1, x"""), and k^(x) is a rij,— vector with k,^j(x) = K^{'x, Xj), for all Xj G X^. 

Conditional on a particular tree, T, the posterior predictive surface described in 
Eqs. (fTT] - [T2|) is discontinuous across the partition boundaries of T. However, in the 
aggregate of samples collected from the joint posterior distribution of (T, 6), the mean 
tends to smooth out near likely partition boundaries as the tree operations grow, prune, 
change, and swap integrate over trees and CPs with larger posterior probability. Uncer- 
tainty in the posterior for T translates into higher posterior predictive uncertainty near 
region boundaries. When the data actually indicate a non-smooth process, the treed GP 
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retains the flexibility necessary to model discontinuities. When the data are consistent 
with a continuous process, as in the motorcycle data example in Section 14.51 the treed 
GP fits are almost indistinguishable from continuous. 

4.4 Implementation 

The treed GP model is coded in a mixture of C and C++, using C++ for the tree struc- 
ture and C for the GP at each leaf of T. The C code can interface with either stan- 
dard platform-specific Fortran BLAS/Lapack libraries for the necessary linear algebra 



routines, or link to those auto matically configured 



platforms via the ATLAS library (jWhaley and Petitet 



' or fas t execution on a variety of 



20041 ). To improve usability, the 



routines have been wrapped up in an intuitive R interface, and are available on GRAN 



( 1R Development Gore Team 



2004J ) at 



http: //www. cran. r-project . org/web/packages/tgp/ index .html 

as a package called tgp. 

It is useful to first translate and re-scale the input dataset (X) so that it lies in an 
sjfjmx dimensional unit cube. This makes it easier to construct prior distributions for the 
width parameters to the correlation function K{-, ■). Gonditioning on T, proposals for 
all parameters which require MH sampling are taken from a uniform "sliding window" 
centered around the location of the last accepted setting. For example, a proposed a 
new nugget parameter g^, to the correlation function K{-, ■) in region would go as 
gl ~ Unif (3(7j,/4, 4(7^/3). Galculating the forward and backward proposal probabilities 
for the MH acceptance ratio is straightforward. 

After conditioning on {T,6), prediction can be parallelized by using a producer- 
consumer model. This allows the use of PThreads in order to take advantage of multiple 
processors, and get speed-ups of at least a factor of two, which is helpful as multi- 
processor machines become commonplace. Parallel sampling of the posterior of for 
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each of the {6^}^^i is also possible. 



4.5 Illustration 

In this section w e illustrate the treed GP model on the Moto rcycle Accident Dataset 



( Silverman 



Rasmussen and Ghahramani 



19851 ). a classic dataset used in recent literature (e.g. 
2002h to demonstrate the success of nonstationary models. The dataset consists of mea- 
surements of the acceleration of the head of a motorcycle rider, which we attempt to 
predict as a function of time in the first moments after an impact. In addition to suggest- 
ing a model with a nonstationary covariance structure, there is input-dependent noise 
(a.k.a., heteroscedasticity). To keep things simple in this illustration, the isotropic power 
family ([2]) correlation function [po = 2) is chosen for K*[-, - Id) with range parameter d, 
combined with nugget g to form K{-, -Irf, (7). 



GP, accel mean and error 



treed GP LLM, accel mean and error 
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Figure 3: Motorcycle Dataset, fit {left) by a stationary process and right by our nonsta- 
tionary model. 

Figure [3] shows the data and the fits given by both a stationary GP (left) and the 
treed GP model (right), along with 90% credible intervals. For the treed GP, vertical 
lines illustrate a typical treed partition T. Notice that the stationary GP is completely 
unable to capture the heteroscedasticity, and that the large variability in the central 
region drives both ends to be more wiggly (in particular, the transition from the flat left 
initial region requires an upward curve before descending). In contrast, the treed GP 
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clearly reflects the differing levels of uncertainty, as well as allowing a flatter fit to the 
initial segment and a smoother fit to the final segment. 20,000 MCMC rounds yielded 
an average of 3.11 partitions in T. 

These results differ from those of Rasmussen & Ghahramani (2002). In particular, 
the error-bars they report for the leftmost region seem too large relative to the other 
regions. They use what they call an "infinite mixture of GP experts" which is a Dirichlet 
process mixture of GPs. They report that the posterior distribution uses between 3 and 
10 experts to fit this dataset, with even 10-15 experts still having considerable posterior 
mass, although there are "roughly" three regions. Contrast this with the treed GP 
model which almost always partitions into three regions, occasionally four, rarely two. 
On speed grounds, the treed GP is also a winner. Running the mixture of GP experts 
model using a total of 11,000 MCMC rounds, discarding the first 1,000, took roughly one 
hour on a 1 GHz Pentium. Allowing treed GP to use 25,000 MCMC rounds, discarding 
the first 5,000, takes about 3 minutes on a 1.8 GHz Athalon. 

We note that the mean fitted function in the right plot in Figure [3] is essentially 
that of a continuous function. Figure H] shows examples of the fits from individual 
MCMC iterations that are eventually averaged. While the individual partition models 
are typically discontinuous, it is clear from Figure [3] that the mean fitted function is 
well-behaved. 

4.6 Limiting linear models 

In some cases, a GP may not be needed within a partition, and a much simpler model, 
such as a linear model, may suffice. In particular, because of the linear mean function in 
our implementation of the GP, the standard linear model can be seen as a limiting case. 
The linear model is then more parsimonious, as well as much more computationally 
efficient. Use of a model-switching prior allows practical implementation. More details 
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Figure 4: Fits of the motorcycle accident data from individual MCMC iterations. 



are available in iGramacy and Led (l2008l ) . The value of such an approach can be seen 
from the fit on the right side of Figure [31 The leftmost partition looks quite fiat, 
and so could be fit just as well with a line rather than a GP. The center partition 
clearly requires a GP fit. The rightmost partition looks mostly linear, and would give a 
posterior which is a mix of a GP and linear model. Indeed, Figure H] shows examples of 
the individual fits, and the leftmost section is nearly always flat, the rightmost section is 
often but not always fiat, and the center section is typically curved but even there it can 
be essentially piecewise linear (the range parameter d is estimated to be large, giving a 
nearly linear fit). Replacing the full GP with a linear model in a partition greatly reduces 
the computational resources required to update the model in that partition. Treed and 
non-treed Gaussian process with jumps to the limiting linear model are implemented in 
the tgp package on GRAN, and we take advantage of the full formulation in our analyses 
herein. 
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5 Rocket Booster Model Results 

We fit our treed GP model to the rocket booster data using ten independent RJ-MCMC 
chains with 15,000 MCMC rounds each. The first 5,000 were discarded as burn-in, and 
every tenth thereafter was treated as a sample from the posterior distribution 7r(T, 6\Z). 
In total, 10,000 samples were saved. On a single 3.2 GHz Xeon processor this took 



lift mean, witli (beta) fixed to (0) lift mean, with (beta) fixed to (0.5) lift mean, with (beta) fixed to (1) 




Figure 5: Posterior predictive mean surfaces of lift for all sideslip angles. Note that for 
levels 0.5 and 3 (center), Mach ranges only in (1,5) and (1.2,2.2) 

about 60 hours. On the same machine, using the same (tuned) linear algebra libraries, 
inverting a single 3041 x 3041 matrix takes about 17 seconds, so obtaining the same 
number of samples from a stationary GP would have taken a minimum of 708 hours. 
This is a gross underestimate because it assumes only one inverse is needed per MCMC 
round. Moreover, it does not count any of the O(n^) operations like determinants of K 
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Figure 6: Posterior predictive variance surfaces of lift for all six sideslip angles. Dots 
show the locations of experimental runs. Darker shades are higher values. 

(assuming a factorized K was saved in computing K~^) or multiplications like ZK~^Z 
in (IH]), nor does it factor in the time needed to sample from the posterior predictive 
distribution. 

Figures [5] & [6] summarize the posterior predictive distribution for the lift response 
for each of the six levels of sideslip angle. Figure [5] contains plots of the fitted mean 
lift surface by speed and angle of attack, and Figure [6] plots a measure of the estimated 
predictive uncertainty given by the difference in 95% and 5% quantiles of samples from 
the posterior predictive distribution. The treed GP works well here. Most of the space 
is nicely smooth, with the sharp transition at Mach one also well-modeled. Most of 
the potential false convergences have been smoothed out. But the estimated variability 
reflects both increased variability where the function is changing rapidly (e.g., near 
Mach one, particularly for higher sideslip levels) and especially where there are issues 
of possible false numerical convergence. Note that the uncertainty is not that high near 
Mach one at sideslip level zero because of the large number of samples taken in that 
region. We also note that the increased uncertainty seen in the top rows around Mach 
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three and higher angles of attack is due to the noisy depression area in the data for 
sideshp level of one-half. Figure [7] shows the MAP treed partitions T found during 



lift: treed partitions (beta=0) 
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Figure 7: MAP treed partitions T for the lift response at sideslip level zero. 

MCMC for the slice of sideslip level zero. Notice the aggressive partitioning near Mach 
one due to the regime shift between subsonic and supersonic speeds. Extra partitioning 
at low speeds and large angles of attack address the singularity outlined in Figure [H and 
near Mach three due to the numerical instabilities at sideslip level one-half. 



Lift Mean & Quantiles: (Alpha,Beta)=(25,0) 
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Figure 8: A slice of the mean fit with error bars as a function of Mach with alpha fixed 
to 25 and beta fixed to zero. 

Figure [S] shows the mean fit and a 90% credible interval for once slice of predicting 
lift, with Mach on the x-axis and considering only angle of attack equal to 25 and 
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slideslip angle equal to (i.e., this is one slice from the upper left plot in Figure [5l this 
plot is from fitting the whole dataset, but we only plot one slice for visibility). The key 
item to note is that the fit is essentially continuous. The plot is made up only of points 
at fitted values, no interpolation or lines have been used. In contrast. Figure [9] shows 
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Figure 9: Slices of the posterior predictive mean from individual MCMC iterations for 
the LGBB data with alpha fixed to 25 and beta fixed to zero. 

examples of the treed GP fits from individual MCMC iterations, which often have clear 
discontinuities from the partitioning structure. Thus as is typical, our mean fitted values 
are quite smooth, because they are an average, even though the individual components 
of the mean may not be continuous. 

To measure goodness of fit we typically rely on qualitative visual barometers. For 
example, traces are used to asses mixing in the Markov chain, and posterior predictive 
slices and projections are inspected, as descri b ed ab ove. For a more quantitative as- 
sessment we follow the suggestion of iGelfandl (119951 ) and use 10-fold cross validation. 



Posterior predictive quantiles are obtained for the input locations held-out of each fold, 
and the proportion of held-out responses that fall within the 90% predictive interval is 
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recorded. For the LGBB data we found a proportion of 0.96 using the treed GP LLM 
model. Thus our model fits well, and if anything, our predictive intervals are slightly 
wider than necessary, so we appear to be fully accounting for uncertainty. 



6 Conclusion 

We developed the treed Gaussian process model for the rocket booster computer ex- 
periment, but it also has a wide range of uses as a simple and efficient method for 
nonstationary modeling. A fully Bayesian treatment of the treed GP model was laid 
out, treating the hierarchical parameterization of the correlation function K{-,-) as a 
modular component, easily replaced by a different family of correlations. The limiting 
linear model parameterization of the GP can be both useful and accessible in terms 
of Bayesian posterior estimation and prediction, resulting in a uniquely nonstationary, 
semiparametric, tractable, and highly accurate model that contains the Bayesian treed 
linear model special case. 



We believe that a large contribution o 



" the treed G 



quential design of computer experiments (jSantner et al. 



will be in the domain of se- 



2003 



Gramacy et al. 



2004h . 



Empirical evidence suggests that many computer experiments contain much linearity, 
as we have seen with large regions of the space for the rocket booster simulator. The 
Bayesian treed GP provides a full posterior predictive distribution (particularly a non- 
stationary and thus region-specific estimate of predictive variance) which can be used 
towards active learning in the input domain. Exploitation of these characteristics should 
lead to an efficient framework for the adaptive exploration of computer experiment pa- 
rameter spaces. 
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